16 research outputs found
End-to-End Lyrics Recognition with Self-supervised Learning
Lyrics recognition is an important task in music processing. Despite
traditional algorithms such as the hybrid HMM- TDNN model achieving good
performance, studies on applying end-to-end models and self-supervised learning
(SSL) are limited. In this paper, we first establish an end-to-end baseline for
lyrics recognition and then explore the performance of SSL models on lyrics
recognition task. We evaluate a variety of upstream SSL models with different
training methods (masked reconstruction, masked prediction, autoregressive
reconstruction, and contrastive learning). Our end-to-end self-supervised
models, evaluated on the DAMP music dataset, outperform the previous
state-of-the-art (SOTA) system by 5.23% for the dev set and 2.4% for the test
set even without a language model trained by a large corpus. Moreover, we
investigate the effect of background music on the performance of
self-supervised learning models and conclude that the SSL models cannot extract
features efficiently in the presence of background music. Finally, we study the
out-of-domain generalization ability of the SSL features considering that those
models were not trained on music datasets.Comment: 4 pages, 2 figures, 3 table
A New Approach to Extract Fetal Electrocardiogram Using Affine Combination of Adaptive Filters
The detection of abnormal fetal heartbeats during pregnancy is important for
monitoring the health conditions of the fetus. While adult ECG has made several
advances in modern medicine, noninvasive fetal electrocardiography (FECG)
remains a great challenge. In this paper, we introduce a new method based on
affine combinations of adaptive filters to extract FECG signals. The affine
combination of multiple filters is able to precisely fit the reference signal,
and thus obtain more accurate FECGs. We proposed a method to combine the Least
Mean Square (LMS) and Recursive Least Squares (RLS) filters. Our approach found
that the Combined Recursive Least Squares (CRLS) filter achieves the best
performance among all proposed combinations. In addition, we found that CRLS is
more advantageous in extracting FECG from abdominal electrocardiograms (AECG)
with a small signal-to-noise ratio (SNR). Compared with the state-of-the-art
MSF-ANC method, CRLS shows improved performance. The sensitivity, accuracy and
F1 score are improved by 3.58%, 2.39% and 1.36%, respectively.Comment: 5 pages, 4 figures, 3 table
Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts
This paper presents a novel algorithm for building an automatic speech
recognition (ASR) model with imperfect training data. Imperfectly transcribed
speech is a prevalent issue in human-annotated speech corpora, which degrades
the performance of ASR models. To address this problem, we propose Bypass
Temporal Classification (BTC) as an expansion of the Connectionist Temporal
Classification (CTC) criterion. BTC explicitly encodes the uncertainties
associated with transcripts during training. This is accomplished by enhancing
the flexibility of the training graph, which is implemented as a weighted
finite-state transducer (WFST) composition. The proposed algorithm improves the
robustness and accuracy of ASR systems, particularly when working with
imprecisely transcribed speech corpora. Our implementation will be
open-sourced
PQLM -- Multilingual Decentralized Portable Quantum Language Model for Privacy Protection
With careful manipulation, malicious agents can reverse engineer private
information encoded in pre-trained language models. Security concerns motivate
the development of quantum pre-training. In this work, we propose a highly
portable quantum language model (PQLM) that can easily transmit information to
downstream tasks on classical machines. The framework consists of a cloud PQLM
built with random Variational Quantum Classifiers (VQC) and local models for
downstream applications. We demonstrate the ad hoc portability of the quantum
model by extracting only the word embeddings and effectively applying them to
downstream tasks on classical machines. Our PQLM exhibits comparable
performance to its classical counterpart on both intrinsic evaluation (loss,
perplexity) and extrinsic evaluation (multilingual sentiment analysis accuracy)
metrics. We also perform ablation studies on the factors affecting PQLM
performance to analyze model stability. Our work establishes a theoretical
foundation for a portable quantum pre-trained language model that could be
trained on private data and made available for public use with privacy
protection guarantees.Comment: 5 pages, 3 figures, 3 table
Unidirectional brain-computer interface: Artificial neural network encoding natural images to fMRI response in the visual cortex
While significant advancements in artificial intelligence (AI) have catalyzed
progress across various domains, its full potential in understanding visual
perception remains underexplored. We propose an artificial neural network
dubbed VISION, an acronym for "Visual Interface System for Imaging Output of
Neural activity," to mimic the human brain and show how it can foster
neuroscientific inquiries. Using visual and contextual inputs, this multimodal
model predicts the brain's functional magnetic resonance imaging (fMRI) scan
response to natural images. VISION successfully predicts human hemodynamic
responses as fMRI voxel values to visual inputs with an accuracy exceeding
state-of-the-art performance by 45%. We further probe the trained networks to
reveal representational biases in different visual areas, generate
experimentally testable hypotheses, and formulate an interpretable metric to
associate these hypotheses with cortical functions. With both a model and
evaluation metric, the cost and time burdens associated with designing and
implementing functional analysis on the visual cortex could be reduced. Our
work suggests that the evolution of computational models may shed light on our
fundamental understanding of the visual cortex and provide a viable approach
toward reliable brain-machine interfaces
Investigating model performance in language identification: beyond simple error statistics
Language development experts need tools that can automatically identify
languages from fluent, conversational speech, and provide reliable estimates of
usage rates at the level of an individual recording. However, language
identification systems are typically evaluated on metrics such as equal error
rate and balanced accuracy, applied at the level of an entire speech corpus.
These overview metrics do not provide information about model performance at
the level of individual speakers, recordings, or units of speech with different
linguistic characteristics. Overview statistics may therefore mask systematic
errors in model performance for some subsets of the data, and consequently,
have worse performance on data derived from some subsets of human speakers,
creating a kind of algorithmic bias. In the current paper, we investigate how
well a number of language identification systems perform on individual
recordings and speech units with different linguistic properties in the MERLIon
CCS Challenge. The Challenge dataset features accented English-Mandarin
code-switched child-directed speech.Comment: Accepted to Interspeech 2023, 5 pages, 5 figure